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Abstract — To reduce datacenter energy consumption and cost, 
current practice has considered demand-proportional resource 
provisioning schemes, where servers are turned on/off according 
to the load of requests. Most existing work considers instanta- 
neous (Internet) requests only, which are explicitly or implicitly 
assumed to be delay-sensitive. On the other hand, in datacenters, 
there exist a vast amount of delay-tolerant jobs, such as back- 
ground/maintainance jobs. In this paper, we explicitly differenti- 
ate delay-sensitive jobs and delay tolerant jobs. We focus on the 
problem of using delay-tolerant jobs to fill the extra capacity of 
datacenters, referred to as trough/valley filling. Giving a higher 
priority to delay-sensitive jobs, our schemes complement to most 
existing demand-proportional resource provisioning schemes. 
Our goal is to design intelligent trough filling mechanisms that 
are energy efficient and also achieve good delay performance. 
Specifically, we propose two joint dynamic speed scaling and 
traffic shifting schemes, one subgradient-based and the other 
queue-based. Our schemes assume little statistical information 
of the system, which is usually difficult to obtain in practice. 
In both schemes, energy cost saving comes from dynamic speed 
scaling, statistical multiplexing, electricity price diversity, and 
service efficiency diversity. In addition, good delay performance is 
achieved in the queue-based scheme via load shifting and capacity 
allocation based on queue conditions. Practical issues that may 
arise in datacenter networks are considered, including capacity 
and bandwidth constraint, service agility constraint, and load 
shifting cost. We use both artificial and real datacenter traces to 
evaluate the proposed schemes. 

I. Introduction 

The fast proliferation of cloud computing has promoted 
rapid growth of large-scale commercial datacenters. Major 
service providers often deploy tens to hundreds of datacen- 
ters distributed nationwide or even worldwide, referred to 
as Internet-scale datacenters (IDC). Because electricity bill 
contributes to a large portion of IDC operational expenditure, 
there have been lots of efforts towards reducing IDC energy 
consumption/cost. 

Researchers have considered designing ioad-aware' IDCs, 
e.g., in [ 1 ][2||4|. The key idea is to provision servers according 
to the load of Internet requests. Extra servers are shut down or 
scheduled in sleeping mode to save energy. In this paradigm, 
a major challenge is to properly size an IDC, i.e., to determine 
the number of active servers, and in the meantime guarantee 
the service requirement. For example, in Q, the authors 
propose to predict the load of windows live messengers and 
provision servers accordingly. In |4j, the authors estimate the 
current load, and design online server provisioning schemes to 
reduce energy and server state transition cost, which is referred 
to as dynamic "right sizing". 

In the above-mentioned work, service requests are typically 
delay-sensitive, i.e., requiring a short delay and low drop rate. 



Such applications include searching or signing in a messenger. 
When the load is lower, more servers would be turned off to 
save energy. However, in practice, an IDC operator may be 
reluctant to turn off servers in a large scale even at a low 
load of requests. One reason is that turning on/off servers 
frequently affects QoS and long term system reliability, as 
considered in HI. But the foremost reason is that there are also 
a large amount of background or maintenance jobs in IDCs to 
process, e.g., searching engine tunes ranking algorithms. Thus, 
the "extra" capacity can be utilized to process the background 
analytical jobs. This is referred to as trough/valley filling. 

Trough filling has not been studied thoroughly. In this 
paper, we focus on intelligent trough filling. We assume 
a given capacity provisioning and scheduling mechanism 
for delay-sensitive jobs (DSJs), e.g., those proposed in 
ID El El US) EHUD. We decide how to use load shifting and 
dynamic speed scaling to control delay tolerant jobs (DTJs), 
e.g., background analytical jobs. On one hand, DTJ load is 
high and thus its energy cost is considerable. On the other 
hand, it is desirable to assure a good delay performance for 
DTJs. The goal of intelligent trough filling is thus to achieve 
energy efficiency as well as good delay performance (or at 
least guarantee the queue stability) for DTJs. 

Intelligent trough filling needs to accommodate the follow- 
ing issues. First, the overall capacity of a datacenter is likely 
to be random, e.g., due to server failure. Second, capacity 
demand of DSJs, such as Internet requests, varies due to 
dynamic load. Given the higher priority of DSJs, available 
capacity for DTJs is random and hard to predict or learn in 
statistics. Meanwhile, the demand of DTJs is also likely to be 
dynamic. 

Further, in order to consider a set of geographically dis- 
tributed IDCs, there are additional constraints. First, load shift- 
ing is constrained by the bandwidth available between IDCs. 
In our setting, similar to capacity, bandwidth is prioritized 
for shifting DSJs, and thus results in a random 'residual 
bandwidth' for DTJs. Second, electricity prices diversity and 
dynamics bring challenges as well as opportunities, e.g., in 
price-aware load shifting Il32l - ll39l . in the context of trough- 
filling. Third, due to heterogenous service agility, different 
classes of DTJs may require different sets of IDCs. Moreover, 
different IDCs maybe heterogenous in service rates and energy 
consumption for each type of DTJs. We consider these issues 
and address the above challenges in this paper. 

In this paper, our goal is to design intelligent trough filling 
mechanisms, that achieve both energy efficiency and good 
delay performance. We design joint dynamic speed scaling 
and load shifting schemes. Specifically, we make the following 



contributions: 

• We focus on trough filling in distributed IDCs, which 
compliments the current work on load-aware capacity 
provisioning, or price-aware load shifting. 

• We consider practical issues in IDCs, such as dynamic 
capacity and bandwidth constraints, dynamic demand, 
and heterogenous service agility and service rates. 

■ We first propose a stochastic subgradient based trough 
filling scheme, named SSTF, with the objective of mini- 
mizing energy and shifting cost while stabilizing the DTJ 
queues. The proposed algorithm does not need underlying 
probability of system states, which is usually difficult to 
estimate. 

• We further propose a queue-based trough filling algo- 
rithm, called QTF, which does not need any statistical 
system information. We show the QTF achieves desirable 
performance in terms of cost and queue delay. 

■ We discuss on how to incorporate capacity provisioning 
and QoS assurance for DSJs into our proposed SSTF and 
QTF. 

• We use both synthetic traffic trace and real datacenter 
traffic trace to evaluate our proposed schemes. Simulation 
results show that QTF outperforms SSTF significantly in 
both cost and queue delay. 

The rest of paper is organized as follows. In Section [HI we 
survey related work. In Section [Till we describe the system 
model. In Section [V] we present stochastic subgradient based 
trough filling scheme. We further propose a queue based 
trough filling scheme in Section [VI] We also discuss how to 
extend the schemes to DSJs and implementation issues in Sec- 
tion I VIII We evaluate our proposed schemes in Section IVIIII 
and conclude in Section HXl 

II. Related work 

Industry and academic research community have paid much 
attention to reducing datacenter energy consumption and cost. 
Solutions are considered in all spectra, including power- 
efficient chip, cooling system, deployment, and many others. 

Our work complements to load-aware server provisioning 
or power-proportional design |fl]]-0. Such works focus on 
server or resource provisioning based on load of Internet 
requests, with service level agreement SLA or other QoS 
metrics assured. For example, in (TJ, the authors propose 
server provisioning and dynamic speed/voltge scaling schemes 
for a data center, through load prediction and feedback control. 
Load prediction-based server provisioning and load dispatch is 
proposed in for connection-intensive Microsoft datacenter. 
Online resource or server provisioning schemes are designed 
in @J4). In 0, the authors consider a relative large time 
interval such that current load of requests can be estimated. 
Server state transition cost is also considered. Furthermore, 
the authors also consider the impact of trough filling on 
energy saving by the proposed scheme though simulations. 
Queue based server provisioning and Lyaponuv optimization 
based performance establishment is proposed in [6]. Although 
the Lyaponuv optimization technique is also used to show 



performance of the queue-based scheme, our problem is differ- 
ent, i.e., we consider trough-filling, with cross-datacenter load 
shifting and capacity provisioning. In 0, the authors propose 
an economic framework which maximizes the total profit of 
resource provisioning for all requests. 

Many other power management schemes for a datacenter 
have been proposed, e.g., in [E)-[30|. Dynamic speed/voltge 
scaling saves power consumption of a processor by adjusting 
the frequency based on the instantaneous load demand, e.g. in 
|8|-| 16 1, which can also be considered as load-aware resource 
provisioning. However, most of the work only considers a 
single processor. In [15], the authors use MDP to find optimal 
stationary DVS and load balancing policy to reduce service 
cost. In this paper, we use DVS as a part of control mechanism 
for trough filling in IDCs. Another popular scheme is virtual- 
ization and server consolidation, e.g., in 11201 - 1251 . which can 
reduce the traffic dynamics by consolidating applications, and 
thus reduce the number of active servers. There are also some 
other works on datacenter-level power management, such as 
workload decomposition 1261 . optimal power allocation for 
servers with total power budget l27l . model predictive control 
(MPC) theory based hierarchical power control [28], and other 
techniques E9l l30l |3"T1 . 

Most recently, cross-IDC power and cost optimization that 
exploits geographic diversity has received significant attention, 
e.g., in ll32l - ||39l . The key idea is to shift requests to IDCs 
with lower electricity prices to reduce cost. The tradeoff is 
the extra delay caused by traffic shifting. Thus, in [34][38), the 
authors consider response time as the constraint. In l37lll39l . 
the authors consider shifting cost as the revenue loss incurred 
by extra delay. Our work can also leverage price diversity, i.e., 
by filling cheap troughs of IDCs. The difference is that since 
background jobs are delay tolerant, our capacity provisioning 
and load shifting schemes also exploit the temporal price 
diversity, in addition to geographic diversity. In a recent 
work 1511 . the authors use energy storage systems to leverage 
the temporal price dynamics to cut the energy cost, but for a 
single datacenter. 

We refer readers to l43l for a survey and [44] for discussions 
on challenges and issues in IDC power management. 

III. System Models 
A. The IDC and server model 

We consider one service provider with a set of N IDCs in 
different locations. An IDC i has K™ 01 * homogenous servers. 
We consider a time slotted system, where the slot length can 
be from hundreds of milliseconds to minutes. We assume in 
each slot t, the number of active servers of an IDC i is fixed 
and is denoted by K\. Note that K\ varies over time, due to 
either dynamic service provisioning (e.g., those proposed in 
El El) or server failure. 

An active server operates at a CPU speed of s. Following 
the models in 111101210361 . we normalize s, i.e., < s < 1, 
where represents the idle state of an active server, and 1 
represents the maximum frequency. We define the capacity of 
an IDC i as the sum of speed of all active servers. If each 



server runs at the same speed s, the total capacity in time slot 
t is Kjs. Clearly, the maximum capacity with Kj servers is 
Kj. In this paper, we consider CPU resource as the the main 
bottleneck and focus on CPU capacity scheduling. The impact 
of other equipments, i.e., memory and I/O, will be considered 
in heterogenous service rates, as discussed in subsection IIII-CI 
Because scaling up/down the speed s of an active server 
only takes several microseconds |[T2llfT8l . which is negligible, 
dynamic speed scaling can be conducted instantaneously in 
each time slot. 

B. Workload model 

We consider two categories of demand: delay sensitive jobs 
(DSJs), e.g., searching, email login in, or messenger sign up, 
and delay tolerant jobs (DTJs), e.g., background analytical 
jobs. DSJs enjoy a higher priority on capacity allocation. The 
remaining capacity can be utilized by the DTJs. Since the 
load of DSJs is usually dynamic, capacity demand of DSJs 
in an IDC i in each slot is considered random. We use Sj 
to denote the capacity allocated to DSJs at IDC i in slot t. 
We assume Sj is given, based on some existing load-aware 
capacity provisioning schemes. Available capacity for DTJs in 
IDC i is thus Kj - Sj . 

For DTJs, they can be further divided into different classes 
to capture their different resource requirements. We consider 
there are in total M different classes of DTJs in the N 
IDCs. If the same kind of DSJs, e.g., tuning webpage ranking 
algorithms, originates (first arrives) at different IDCs, we 
treat them as different classes. This is because they may 
have different sets of IDCs to be shifted to due to distance 
constraints. For DTJ j, it first originates at an IDC i. Let 
D\ denote the traffic or load size of DTJ j in time slot t. 
Dj is a random variable. We do not make assumptions on its 
distribution. 

C. Models for load shifting and service 

Although a DTJ j originates at an IDC i, we can shift the 
traffic to other IDCs, e.g., to exploit their available capacity or 
lower electricity prices. Note that cross-IDC load shifting is 
practically feasible due to negligible shifting time delay |36|, 
which has been widely considered, e.g., in [32|-[42|. Load 
shifting has practical constraints. First, due to limited service 
agility of IDCs, a class of DTJ j can potentially be served 
by only a subset of IDCs. Let Tj denote the set of IDCs that 
can serve DTJ j, which is different for different classes of 
DTJs. DTJ j can only be shifted to IDC i , where i € Tj. 
Second, bandwidth between IDCs is limited. Moreover, due 
to potentially load shifting for DSJs, which also requires a 
high priority of bandwidth provisioning, available bandwidth 
for DTJs is limited and dynamic. This consideration is similar 
to that in a very recent work [41], where the authors develop 
a system to rescue unutilized network bandwidth for shifting 
the non-real-time bulk data, e.g., backup data. We use £?*., to 
denote the available bandwidth from IDC i to i for DTJs in 
slot t. B*, varies over time, and can be set in an appropriate 
value to prevent significant network delay. Note when two 



TABLE I: Main Notations 
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Number of active servers of IDC i in slot t (K> for state u) 
Traffic arrival of DTJ j in slot t 

Bandwidth constraint for DTJs between IDC i and i in slot t 

Set of different types of DTJs shifted from IDC i to i 
Set of IDCs that can serve DTJ j 
Set of different types of DTJs served by IDC i 
Capacity/speed allocated by IDC i (i £ r 3 ) to DTJ j in slot t 
Capacity/speed allocated by IDC i to DSJs in slot t (Given vairiable) 
Capacity/speed matrix in slot t (S w for state uj ) 
Unit service rate by IDC i for DTJ j 
Power consumption of IDC i in slot t 
Electricity price of IDC i in slot t (af for state uj) 
Load shifting cost between IDC i and i in slot t 

Total cost function on S* in slot t (g u () for state uj) 
Distribution of system state oj (unknown to SSTF) 
Delay tolerant (sensitive) jobs 



IDCs have limited connections or a long distance such that 
load shifting is not desirable, B 1 ., can be set as for all time 
slots. Let D . , denote the traffic of DTJ j shifted from IDC i 

to i . Further let T u > denote the set of DTJs that first arrive at 
IDC i and can be served by IDC i . We have J^^-r D*. ., < 
-B'./ as the load shifting constraint. 

For an IDC i G Tj, it allocates a certain capacity to DTJ 
j in time slot t, denoted by Sjj. We have S* = {Sjj\j = 
1, . . . , M, i G Tj}, as the capacity allocation matrix, which is 
our control variable. An IDC i may serve multiple DTJs. Let 
IL denote the set of all DTJs served by an IDC i. Obviously, 
we have the capacity allocation constraint as X^en &tj — 
Kt — <?* 

J\ { D iQ . 

With capacity Sjj, DTJ j receives a certain service rate. 
We use the Rij(S}j) as the service rate function on the 
capacity. For simplicity, we consider Rij () as a linear function 
of Sjj, i.e., (Sjj) — Sjj. The unit service rate is 
heterogenous for different pairs of DTJ j and IDC i. This 
is because different DTJs may require different memory, I/O 
resource, etc. Load shifting and dynamic speed scaling are 
coupled. The amount of traffic of DTJ j shifted from IDC i 
to i depends on the capacity allocated at IDC i . Thus we 
have D* . < ry 4 . Since both energy and load shifting 
cost increase with S f :, ., we have £)*.../ = ry -S*> .. 

The unfinished jobs of a DTJ j are buffered in a queue at 
the IDC where DTJ j originates. Let Qj(t) denote the queue 
in time t, the queue dynamics of DTJ j can be written as 



Qj(t + 1) = max 



ier,- 



+ D), (1) 



where J^ier- r ij^lj 15 m e total service rate a DTJ j receives 



in slot t. 

D. Power consumption and cost model 

According to liTTllfT2l . power consumption of a server 
(processor) running at a speed s G [0, 1] is 

P(s) =ps v + \-p, (2) 

where the exponent v > 1, with a typical value of 2 |12|, 
and 1 — p represents the power consumption in the idle state, 



which is around 0.6, and hardly lower than 0.5 ||2). In this 
paper, we choose v = 2, as in [12]. Note that our schemes 
can be extended to the cases with other values of v. 

Consider an IDC j. In a time slot t, there are K\ active 
servers, and the total capacity demand is S\. It can be shown 
that the most energy-efficient operation is to let each server 
evenly share the demand, i.e., each server is running at a speed 
■jA, which results in a total power consumption in slot t of 



D* 



Pf = (1 - p)K\ 



K 



(3) 



where S\ = Sj 



X^gn Because we focus on trough- 
filling, we take K\ and Sj as given constants in each time 
slot. We only control Sfj. Note that Pf is a convex function 
of SI 

Besides the power consumption of servers, other compo- 
nents in an IDC, e.g., memory, I/O, hard disk, and non-IT 
equipments such as cooling systems, also contribute to the total 
power consumption, which is roughly proportional to that by 
servers |46l . Thus total power consumption of an IDC can be 
obtained by scaling up Pt with a constant factor. For notation 
brevity, we absorb this constant factor into the electricity price 
at IDC i. Electricity price exhibits significant diversity in both 
location and time. We use a\ to denote the price at IDC i in 
time slot t. Although a* is a time-varying variable, it varies 
slowly. Typically, in a wholesale market, a\ is determined by 
Regional Transmission Organization (RTO) day-ahead based 
on expected load and changes hourly; or alternatively, a\ is 
determined in real-time (every 15min) based on the actual 
load. We consider energy cost of an IDC as the product of 
power consumption and its electricity price. 

E. Load shifting cost 

We also consider load shifting cost. In practice, datacenter 
operators may have a lease with ISPs for data traffic among 
IDCs. Some large operators like Google and Microsoft may 
even have their own backbone networks to interconnect the 
IDCs. Either case, shifting cost is usually incurred during the 
acquisition or construction phase, which depends less on the 
traffic volume that the internal links carry B31 . However, since 
DTJs have a lower priority, it is desirable to schedule a limited 
link bandwidth to them. For example, when the time slot is 
relatively long, a higher utilization of the link capacity by 
DTJs will make the system more sensitive to the burst of DSJs, 
which enjoy a higher priority on load shifting. To prevent the 
increasing sensitiveness to DSJs, we use a piece-wise linear 
cost function with increasing rate to model the shifting cost 
for DTJs. Let (h 1 ., denote the shifting cost in slot t between 
IDC i and i , we have 



6* ./ = max < a 9 . 



ex. 



£>*..., 



+ bt \ ,0 = {1,2,. ..,6}, 



where -M — — is the link capacity occupation ratio by 

DTJs. We have a 1 ., < . . . a.., . . . < a 9 ., , which captures the 
increasing sensitiveness to capacity occupation ratio by DTJs. 
(b 1 ., is a convex function on D* . ., , and thus on S , since it is the 
pointwise maximum of a set of affine functions, and £)*.., is 
linear on S*. The model is also widely considered by previous 
works, e.g., in 147 1. Note that our work can also incorporate 
other shifting cost models with minor modifications. 

IV. A BENCHMARK SCHEME 

In this section, we first consider a benchmark scheme, where 
the goal is to minimize the time average of the total cost of N 
IDCs, including energy cost and shifting cost, while stabilizing 
the M DTJ queues. We name it stability-assured cost optimal 
trough-filling (SCOTF). In each time slot, both the energy cost 
and the shifting cost are functions of S . The overall cost in 
each slot also depends on K\, a\, and Sj , i = 1, . . . , N. Thus 
the overall cost is a time-varying function on S , denoted by 
g t (S t ). Besides, capacity allocation and shifting constraints, 
i.e., C\ and B 1 .., , are also time-varying. Thus S* takes values 
in a time-varying set. Let A* denote the set of S* that satisfies 
capacity allocation and shifting constraints in slot t. SCOTF 
is formulated as 

min s t liminfr^ooy Y%=1 5*( s ') 
s. t. limsupj^^^Li Qj(t) < °°> ( 5 ) 
S*eA*, j = l,...,M, (6) 

where the first constraint is to guarantee each DTJ queue's 
stability. Note we use 'sup' ( 'inf ) to guarantee the infinity 
exists. 

It is difficult to solve problem (|6]l directly in practice, 
because it is hard to obtain prior system information of all time 
slots. We present the problem of SCOTF here as a cost bench- 
mark. Our proposed schemes, one stochastic subgradient- 
based and one queue-based, require little system statistical 
information, and thus are more practical. The objectives of 
proposed schemes are not limited to guaranteeing DTJ queue 
stability as in SCOTF. Good delay performance is also desired, 
especially for the queue-based scheme. 

V. Stochastic subgradient based trough filling 

We first consider an ergodic scenario where system state has 
a steady state distribution. Here a state characterizes a unique 
set of all variables involved in the system, including K\, a\, 
S* , and .B*.,, i, i 6 1, . . . , N. Let ft denote the set of system 
states, and to a generic system state, uj G ft, n u the steady 
distribution of uj, g^Q is the cost function in state ui. Let 
denote the capacity allocation matrix in state uj, which is in 
the set A u . Let A denote the mean of arrival rate vector of 
DTJs. SCOTF can be rewritten as 

min g e = J2uen^9 u (S u ) 
s- t. E. e n^(S w )>A 



(4) 



S w e A" 



(7) 



We use g* to denote the optimal solution to the above problem, 
i.e., optimal cost in the ergodic system case, with the arrival 
rate A stabilized. In practice, A can possibly be estimated 
by historic database or prediction schemes. If the steady 
state distribution ir u is available, then (0 is a deterministic 
convex optimization problem. However, in practice it may be 
difficult to obtain such statistical knowledge. We thus design 
a stochastic subgradient-based algorithm that can solve (Q, 
without prior information on n u . Note the scheme needs the 
information of the average rate, i.e., A, or at least an upper 
bound to guarantee stability. 

We first define a Lagrangian function associated with prob- 
lem (01 as 



M 



= E ^ 



GO 3 = 1 ui&fl ieTj 



J J' 



(8) 

where S = {S u \u G Q}, S" e A u , and p = (jn, . . .,fj, u ) is 
the set of the Lagrangian multipliers. Note p > 0. The dual 
problem of © is defined as 



max Ftp), 
p>o 



where 



min L(p, S). 
S 



(9) 



(10) 



To solve the dual problem, we first consider dSJ. For a given 
multiplier p, the problem is separable for different states. Thus, 
we can solve the following problem for a given state ui, 



min^S" 



M 

E 



E r V S ij 



A,) 



(11) 



s.t. S u eA u . 

An examination of ( fTTT i yields the following optimization 
problem of joint capacity allocation and load shifting after 
observing system state in the current slot 



N 

min J2 <*i 

s i=l 

N 



(1 - P)Kt 



+ 



E maxi<^< e <, « 

■>3 



s. t. 

E 



E < Kt 

jeiu 



, N 



l,...,N,i^i. (12) 



In (fT2l . the first item is the total energy cost, the second is 
the shifting cost, the first constraint is the capacity constraint 
on DTJs in IDC i and the second constraint is bandwidth 
constraint between IDCs i and i . Clearly, (fT2l is a convex 
optimization problem of S w . This is because, the objective 
function is the sum of a set of convex and affine functions of 
S w , and the constraints are both affine and thus convex. We 
can solve it efficiently for a given state lu in each time slot. 
When capacity allocation is determined, load shifting policy 



is also jointly determined, i.e., shift an amount of ry . for 

DTJ j from IDC i to i if j e T u > . 

The dual problem can be solved using a stochastic subgra- 
dient algorithm |49|, which has the following iterative steps 



fi 



n+1 r , ■ i ,^^1 + 



(13) 



where n denote the nth iteration, i.e., nth time slots in our case, 



and a n = (cr™, . . 
that is chosen as 



Cjjf ) is the vector of stochastic subgradient 



E(a n \p°,...,p n )=d p F(p n ), 



(14) 



where dpF(p n ) is a subgradient of F(p) at pP~, In this case, by 
updating p n using (fTJI l, p n converges to the optimal solution 
of the dual problem (O with probability 1, if the following 
conditions are satisfied 



£(K : 



' M 



,P n )<c, 



(15) 



where c is a constant, and Ei^Lo @ n — °°> E^Lo(^™) 2 = °°- 
Note a candidate for B n can be -. 

' n 

The subgradient dpF(p) can be a set, where by Danskins 
Theorem [50|, we can choose a subgradient as 



dftMp) 



E ~- E r -.'- s <"/' • A /- 3 



1,...,M, (16) 



where Sfj* is the optimal solution to problem ( fT2l . Note that 
a 7 j is a stochastic subgradient if its expectation equals to a 
subgradient. We can choose tr™ as 



E/ Vi 3^ij 

ier\j 



+ A„ j = l,, 



,M, 



(17) 



where w" is the index of the system state at iteration n. (fl3l l 
is satisfied, because rijS"j is bounded, Vi,j, which leads to 
bounded cr™, Vj. cr" defined in (fTTT i is a stochastic subgradient, 
because we consider an ergodic setting and thus the time 
average of cr™ equals to the subgradient of ( TToT l. Further, since 
the original problem (O is a convex optimization problem that 
satisfies the Slater's condition, there is no duality gap. 

We name the above algorithm stochastic subgradient-based 
trough filling (SSTF). SSTF converges to the optimal solution 
of problem ©. Thus it can achieve the optimal cost given 
a service rate that assures queue stability. Note that SSTF 
can work in non-ergodic settings. Lagrangian multiplier p has 
practical properties. It can be considered as a price, which 
increases as service rate being smaller than the average arrival 
rate, i.e., capacity under-provisioning. In practice, by updating 
p, SSTF can achieve good cost performance. Moreover, the 
objective of SSTF is not limited to cost optimality only. One 
can tune the average service rate of SSTF, i.e., by adjusting 
A in Q, to control the DTJ delay. Thus, SSTF is NOT 
SCOTF in the ergodic setting. Another benefit of SSTF is 
that it also exploits temporal diversity of electrical prices. 
However, SSTF needs the knowledge of the average DTJ 
arrival rate, which may not be available in practice. Further, 
it may converge slowly and it is difficult to characterize its 
delay performance. This motivates us to consider the following 



queue-based algorithm, which leverages queue information so 
that neither A nor system distribution information is required. 



VI. Queue based trough filling 

A. Algorithm Design 

In this section, we present a queue-based algorithm that ex- 
plicitly considers queue backlog of DTJs. The algorithm takes 
the instantaneous system state (i.e., queue length, available 
server capacity and bandwidth, DSJ load demand) as the input. 
The algorithm also has a parameter to control the tradeoff 
between cost and queue delay. We will also show that the 
algorithm achieves bounded average queue backlog such that 
the system is stabilized, while the cost can be arbitrarily close 
to the optimal cost achieved by (0. 

In each time slot t, observe current queue backlog 
Qj(t),j = 1,...,M, a\, Sto, Of, and £*.„ i = 1,...,N. 
Allocate the capacity at each IDC i for each queue j according 
to the following optimization scheme, named queue-based 
trough filling (QTF): 
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Similar to (TTSb . ( fT9] > is a convex optimization problem. Thus 
at the beginning of each slot, capacity allocation S* can be 
determined efficiently. 

M 

The intuition of QTF is clear. When queue length £ Qj (t) 

3 = 1 

is high, QTF has incentive to allocate a larger capacity to 
reduce the queue length. When the cost is relatively large or 
queue length is small, QTF is driven to allocate less capacity 
to reduce the cost. The control variable V is to balance the 
queue length and cost. If V is large, QTF will result in lower 
cost but longer average queue delay. 

To better illustrate the intuition of the algorithm, we further 
consider a special case, where there is only one IDC with 
M delay tolerant queues. In the single IDC case, we can 
simplify notations by removing subscript i. The capacity 
vector becomes S* = {S[, . . . , S l M }. We have the following 
scheme for capacity allocation, named single-IDC queue-based 



trough filling (SQTF) 
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(22) 
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We have the following solution on S . 

Observation 1: SQTF allocates S* as: in each time slot 
t, choose the queue with the maximum Qj(t)rj, denote as j , 
then 
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In other words, SQTF is a threshold-based policy, which 
serves the longest queue and only when its queue length is 
above a certain threshold. 

B. Performance analysis 

In this subsection, we analyze the performance of the QTF 
algorithm in terms of the cost and average delay performance. 
Our analysis is based on Lyapunov drift optimization BP . 

Define rt = max{rij\j G IT}, i.e., maximum unit service 
rate for all DTJs in IDC i. Let D" 1 denote the upper bound of 
arrival traffic size of DTJ j in each slot. We have the following 
proposition. 

Proposition 1: Assuming traffic of DTJs is i.i.d in each slot 
with mean X, the QTF algorithm stabilizes the system for a 
given parameter V. In addition, an upper bound on average 
queue length is 



lim 



, T M 



(25) 

Further, average cost achieved by QTF, which has a cost 
denoted as g l q (S ) in each slot t, is upper bounded as 

~2 Tsmax'Z 
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(26) 

where g* is the optimal solution to problem (0, and e is 
a positive value, g*(e) is the optimal solution to with X 
replaced by X + le. 

Proof: In the Appendix. 

VII. Discussions 

A. Joint DSJ and DTJ design 

Although SSTF and QTF are both proposed for trough- 
filling, with some modifications, they can be used for joint 
DSJ and DTJ capacity provisioning. First, Sf , for DSJs, will 
become a part of the decision variables, together with Sfj 
for DTJs. An important issue is how to guarantee service 
requirements for DSJs. 



For SSTF, we can simply introduce a QoS constraint for 
DSJs. For example, if the slot length is large, i.e., tens of 
seconds to minutes, following |4), we can estimate the mean of 
DSJ rate for IDC i in the beginning of the current slot, denoted 
by A* . Note that it is possible for A' to incorporate traffic 
from other IDCs due to certain traffic shifting schemes. Let r^o 
denote unit service rate for DSJs in IDC i. Following 1341 . a 
delay constraint can be imposed, e.g., — „ t 1 _.f < 6, which is 
a linear constraint on S* , and thus can be easily incorporated 
to our convex optimization problem. When the time slot length 
is small, such as hundreds of milliseconds, it is unlikely to 
estimate mean of DSJ traffic in the current slot. In this case, 
one may assume DSJ traffic follows certain distributions based 
on past measurement. One can define outage probability as a 
QoS constraint. That is, the probability that the load of DSJ in 
IDC i, i.e., D| , exceeds capacity S\ Q . The DSJ QoS constraint 
can be expressed as Pr(D* > 5| ) < Si. Based on the 
knowledge of traffic distribution, e.g., Gaussian or exponential 
distribution, one can rewrite the constraint function as a 
convex function of Sj Q , Since time time slot length is small, 
outage probability can be easily measured. Adjusting S* is 
probably necessary to eliminate the discrepancy between the 
real distribution of D\ Q and the assumed one using stochastic 
approximation schemes. 

Similar approaches can be applied to extend QTF For 
example, one can use the outage probability as a DSJ QoS 
constraint. Let Si denote outage probability constraint. To 
enforce it, we can design a virtual outage queue. Let Ij(-) as an 
indicator function. We have Ii (t) = 1 if there is outage in slot 
t, i.e., D' > Sj , and Ii(t) — otherwise. We use Oi(t) to de- 
note the virtual outage queue backlog in slot t, which updates 
as Oi(t + 1) = max {Oi(t) — <5,,0} + Ii(t). It can be shown 

that the virtual queue is stable if lirriT^oo j. < Si, 
i.e., outage probability constraint satisfied. Note that Si can 
be considered as the service rate of the virtual queue. Using 
the virtual outage queue, we can modify QTF to provide 
capacity provisioning for DSJs. It is our future work to further 
investigate the joint design of capacity provisioning and QoS 
assurance for DSJs, and trough-filling. 

B. Implementation issues and caveats 

In our schemes, the decision-maker needs to gather the 
input in the beginning of each slot. The messaging delay is 
about tens of milliseconds ll36l . and each IDC only has a few 
parameters sent to the decision-maker. Each time slot can be 
from several seconds to some minutes. Thus the messaging 
overhead is negligible. Note that the decision overhead is 
also negligible since the convex optimization problems can be 
solved efficiently. Load shifting overhead, i.e., network delay, 
can be easily constrained by controlling the bandwidth for 
DTJs. 

In this paper, we consider homogenous servers for simplic- 
ity. However, in an IDC, servers may be different in terms of 
power consumption, maximum speed, and memory. To apply 
our schemes, we can further classify the servers to different 
units. Homogenous or similar servers belong to one unit. The 
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Fig. 1: Delay and cost of different schemes with different ratio 
between load of DTJ and DSJ. 



input is no longer IDC-based, but unit-based. In practice, we 
can simply classify servers according to their ages. Typically, 
there are three stock-keeping units (SKUs) in an IDC, i.e., 
latest, one-year-old, and two-year-old. 

In practice, some DTJs may need to be finished by a 
deadline. Different classes of DTJs may have different dead- 
lines. Designing energy-efficient DTJ scheduling algorithms 
with heterogenous deadlines for IDCs is an interesting open 
problem. We will consider it in the future. 

In this paper, we mainly focus on CPU-intensive DSJs. We 
will also extend our work to I/O intensive DTJs. Besides, 
we will also explicitly consider the effect of virtualization, 
by which performance versus power curve may become more 
difficult to quantify l22l ll36ll . 

VIII. Performance evaluation 

In this section, we evaluate the performance of SSTF and 
QTF, using both synthetic and real traces. 

A. Synthetic traces based simulation 

1 ) Simulation setup: We consider five IDCs in different lo- 
cations. There are totally ten DTJ queues randomly originated 
in one of the five IDCs. The IDC set Tj that can serve a DTJ 
j is chosen randomly. Idle power consumption 1 — p is set as 
0.5. To create an ergodic setting, we set 100 states, in each of 
which we set different total capacity, load shifting constraint, 
demand by DSJs, and electricity prices. Capacity of each 
IDC is uniformly distributed from 10k to 15k. Load shifting 
constraint is uniformly distributed from 3000 to 4000. Load 
shifting cost parameters are set the same as in J47). Electricity 
price is uniformly distributed from 1 to 10. DSJ demand, set 
as a ratio of the total capacity, is randomly distributed from 
to 0.4. Thus average DSJ demand is about 20% of the total 
capacity. We consider different ratios between the load of DTJ 
and DSJ, by setting different average arrival rates of DTJs. The 
ratios are 0.5, 1, 1.5, 2, 2.5, 3, and 3.5, respectively. Thus the 
percentage of DTJ demand in the total capacity ranges from 
10% to 70%. We simulate 100k time slots in each of the 30 
simulation settings. In different time slots, a system state is 
chosen randomly according to a predefined probability. 




2) Simulation results: We first compute the Optimal Solu- 
tion to (0 with the System distribution Information, which is 
difficult to obtain in practice. We name it OSSI and compare 
it with SSTF and QTF. First, by Fig. Q] we observe that the 
cost of SSTF is very close to that of OSSI, under different 
DTJ load ratios. Their queue delays are also very close. In 
this paper, since we also consider idle power consumption, 
i.e., (1 — p)K\, and DSJ power consumption. When load of 
DTJs is low, such as with the ratios of 0.5 and 1, costs of 
different schemes are very close because the impact of DTJs is 
small. To study the convergence of SSTF, we also consider the 
DTJ power consumption separately. Results show that SSFT 
and OSSI achieve very close performance in terms of cost 
and delay. We do not plot results here due to the page limit. 
In Fig. [T] we consider QTF with V = 1 and V = 1000, 
respectively. For both cases, we see QTF leads to a higher 
cost, but the queue delay is significantly smaller compared to 
that by OSSI and SSTF. QTF with V = 1000 has a slightly 
larger cost than OSSI and SSTF, but much smaller delay, even 
when DTJ load is high, e.g., with a ratio of 3.5. In this case, 
QTF with V = 1 has a very small delay, i.e., almost 1, with 
a much higher cost. Thus, in practice, one can tune the value 
of V to obtain a desirable tradeoff between cost and delay, 
especially when load of DTJs is high. 

In Fig. [T] the queue delay of OSSI and SSTF is very 
large, which holds even when we set the average service rate 
(slightly) larger than the arrival rate. We examine the service 
rates of a DTJ queue in different time resolutions to find the 
reasons. We first consider one slot service rate, normalized 
over the average DTJ arrival rate. We plot 100 slots rate in 
Fig. |2h and (b), for SSTF and QTF, respectively (Service rate 
by OSSI is very similar to that by SSTF). Note in Fig. |2 
the ratio of DTJ load is 1 and V for QTF is 1000. It is 
observed that rate assignment by SSTF is quite even for each 
slot. The DTJs always receive a service rate in each slot. Rate 
assignment by QTF is much more bursty. Service rate is non- 
zero only by every several slots. This result is consistent to 
Observation 1 where we show capacity allocation by QTF for 
a single IDC is a threshold-based policy based on the queue 
length. In the time slots without being served, jobs accumulate 
and queue delay increases. This is the reason that there is a 
queue delay about 5 in Fig. Q] for QTF (V = 1000 and DTJ 
load ratio of 1). Nevertheless, queue stability is guaranteed 
since service rates are fairly large every several slots such that 




50 70 

Percentage of DTJs in total load (%) 

Fig. 3: Delay and cost by SSTF and QTF on real traffic trace 



jobs accumulated can be finished. We also examine a large 
time resolution rate, i.e., average rate over every 1000 time 
slots (normalized over average arrival rate). We plot results in 
Fig. |2jc). An interesting observation is that in this case, rate 
by SSTF is more bursty than that by QTF. Then during the 
periods that normalized service rates are lower than 1, DTJs 
accumulate such that queue length is fairly large in most slots. 
Although jobs can be finished during periods when service 
rates are large than 1, significant delay cannot be avoided. 

One can increase average service rates of SSTF to obtain a 
smaller delay. But much more capacity needs to be consumed, 
which results in much higher cost. In many cases when load 
of DTJ is high, there is little space for SSTF to increase 
service rates. QTF can lead to arbitrary delay by tuning V. 
One important property of QTF is that no matter V is large or 
small, the average service rate of QTF is always close to arrival 
rate, because it leverages the queue information. Thus QTF 
provides a more efficient method in saving cost and reducing 
delay. There are other findings, such as load shifting also plays 
an important role in reducing cost and queue delay. Due to the 
page limit, we omit them here. 

B. Real trace based simulation 

In this subsection, we use real datacenter traffic trace to 
study the performance of SSTF and QTF. Our trace comes 
from a commercial datacenter operated by a large cloud 
service provider in U.S. We obtain a Hadoop distributed file 
system (HDFS) log for one datacenter for thirty days. The 
HDFS log records the information of all received packets, 
including the packet size and time-stamp. The original data 



does not differentiate DSJs and DTJs (In fact, to differentiate 
such traffic without application-layer information is itself a 
challenging issue in practical data center operations, which 
is an active research topic itself.). To address this issue, we 
simply adopt a threshold-based policy. We assume that a large 
packet is likely to be delay tolerant, and treat a packet with a 
size larger than a certain threshold as DTJ. This classification 
is rational, as authors in [48 ) indicate that most Internet request 
such as searching and web browsing are are only a few kb in 
size. We set threshold as 10, 50, 100, and 150Mb, to obtain 
different ratios between DSJ load and DTJ load, which results 
in the percentage of DTJ load in the total load roughly as 
90%, 70%, 50%, and 10%, respectively. Note here we assume 
one unit (Mbit) of DSJs requires one unit of capacity, and one 
unit of DTJs requires 0.133 unit of capacity on average, by 
the same rate setting as the above simulations (average unit 
rate is roughly 7.5). 

To simulate multiple IDCs and multiple DTJ queues, we 
choose twenty days of large packet traces as ten DTJ traffic 
traces, so that each of them has a two-day traffic trace. We 
choose ten days of small packet traces as the demand of DSJ 
for five IDCs considered. We consider a time slot length as 20 
seconds. Therefore we have 8640 time slots for each two-day 
traffic trace. 

Further, we use the electricity data in five wholesale market 
regions in 02/22/2011. They are California (Hub SP 15- 
EZ), Louisiana (Entergy), New England (NEPOOL Mass), 
Pennsylvania (PJM West), and Texas (ERCOT SOUTH). The 
capacity is uniformly distributed between 1000 and 1200. The 
bandwidth constraint is uniformly distributed between 1000 
and 1500. The other setting is the same as in the synthetic 
traffic case. 

We compare SSTF and QTF to the best effort service 
scheme (BES). In each slot, BES serves as much demand as 
possible for DTJ queue, in a best-effort fashion. When the 
available capacity in an IDC is not enough to finish current 
jobs, it equally shares the capacity among all DTJ queues. 
In the simulation, we assume SSTF knows the average DTJ 
arrival rate. The average service rate of SSTF is set equal to 
the average DTJ arrival rate. The control variable of QTF is set 
to 1000. We observe from Fig. |3]that for different percentages 
of DTJ load, BES always leads to the highest cost, while SSTF 
always has the lowest cost. The delay of SSTF is large, almost 
5 hours. One reason is that it explores temporal electrical price 
diversity in a large time scale. One may think that BES would 
result in the lowest delay. But in Fig. [3] average delay of 
BES is always larger than that of QTF. The reason is that 
load shifting is not used in BES. Thus queues suffer large 
delay in an IDC with less available capacity. This illustrates 
that load shifting is not only necessary in reducing cost, but 
also important in exploring available capacity to improve delay 
performance. In summary, in Fig. [3] we observe that QTF is 
efficient in both saving cost and reducing delay. 

It is also observed that as the percentage of DTJ load 
increases, the total cost decreases and the average DTJ delay 
also decreases. The reason is that when DSJ load decreases, 



total load amount decreases as DSJ requires more capacity 
per unit traffic. More capacity is thus available for DTJ, which 
leads to a smaller DTJ delay and more space for energy saving. 

IX. Conclusions 

In this paper, we study intelligent trough filling that achieves 
both energy efficiency and good delay performance. We design 
joint dynamic speed scaling and load shifting schemes. We 
first present a stochastic subgradient based trough filling 
algorithm, named SSTF, which solves a convex optimization 
problem for capacity allocation and load shifting in each slot. 
SSTF does not need the information of underlying distribution 
of system state. The SSTF can converge to optimal cost 
with a certain service rate constraint. We further propose a 
queue-based trough filling algorithm, named QTF, which also 
solves a convex optimization problem for capacity allocation 
and load shifting in each slot. We show QTF can achieve 
optimal tradeoff between queue delay and cost. Our extensive 
simulations based on both synthetic and real datacenter traces 
show that SSTF achieves optimal cost, but has a large delay. 
QTF achieves both desirable cost and delay. In practice, SSTF 
can be applied to the scenario where DTJs can have a large 
time delay, e.g., half of a day. QTF can be applied to the case 
where smaller time delay is desirable, e.g., tens of minutes. 
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Appendix 

To prove Proposition 1, we first need Lemma 1 as 
Lemma 1: For the optimization problem Q, with X re- 
placed by X + le, the resulting optimal solution g*(e) reaches 
g* as e reaches 0. 

Proof: We write the Lagrangian of problem (O as 
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When A replaced by A + le, we have L(S,X + le, fi) — > 
L(S,X,jl) as e — > 0. Since is a convex optimization 
problem. We have gl(e) reaches g* as e reaches 0. 

We next present proof to Proposition 1. 

Proof: Consider the M DTJ queues Q(t) = 
(Qi(t), . . . , Qm(£))- We introduce a non-negative Lyapunov 
function as L(Q(t)) 
Lyapunov drift as 
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In terms of the fact that (max [a — b, 0] + c) 2 < a 2 + b 2 + c 2 
2a(c — b), for any a,b,c> 0, we have 
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Based on 

A(t) <E 

2E 



we further have 



M 



E 



M 



EE( r ^ifi*) 

j=i ier, 

M / 

3=1 V ier, 



TO) 



(30) 



Note 



rlSl 2 



3 =i ^i&Tj • is bounded by 

2ieur 3 -,Vj rfKf ax2 , where n = max{r.y |j e LTj}, i.e, 
the maximum service rate with full server capacity. In each 
slot, we also have assumed that the arrival traffic size of each 
DTJ j is bounded by Df. 

For brevity, here we define B = X^eur ■ Vj r i^-f + 
^2jDf 2 . Since traffic of DTJs in each slot is independent 
of queue backlog Q(t), we can rewrite (30) as 



A(t) < B + 2 Y Qj (t)\j - 2E 



\Q{t) 

j=i ier. 



(31) 



We consider the drift-plus-cost for the system where cost 
is resulted by QTF. The cost is the expected cost that is 



conditional on queue backlog in time slot t, which can be 
written as £(^(S*)|Q(f)). Note V is a control variable, we 
have 



M 

A(t) + VE\g\^)\Q{t)\ < B + 2 V Qj(t)\ 



As T — > oo, we have lim T ^co 7 ELi £ [^( sf )l < 
5*( e ) + T7- Since by Lemma 1, we have g*(e) — > g* 
as e reaches 0. d38b is independent of e. Thus we have 

„ r _t/«,tvi ^ , B ho j ds ^ 



lim^^E^i^S*)] <<?, 



r 



2E 



M 



VE[0$(S«)|Q(t)] 



(32) 



By (132t . we can see that QTF minimizes drift-plus-cost in 
each time slot. Thus we have 



v fl j(s*)-X)0i(*)Z) r « 5 «i^(*) 

V S :(e)-X)0i(*)(Aj + c)|^(*) 
i=i 



2^Q J (t)A J +2i5 

M 

<2^Q j (t)\ j +2E 

3=1 

= -2e^)Q j (t)+y fl :(e 

3=1 

By (|32t(l33t, we have 

M 

A(t) < B — 2e^Q J -(i) + ^|(e) - VB[ffJ(S*)|<3(i)]. 

(34) 

Taking expectations of drift A(t) with respect to the distribu- 
tion of the random queue backlog Q(t) at time t, we have 



(33) 



E 



L(Q(t + l))-L(Q(t)) 



< 



A I 



B - 2eJ2E[Q j (t)\ + Vg* e (e) - VE[g\(S% 



(35) 



The above inequity is satisfied for all time slot t. Summing 
the A(t) over time slot t = 1, 2, . . . , T, we have 



E 



< 



L(Q(T)) - L(Q(l)) 

T M T 

TB 2e E E £ (*)] + (e) - ^ E ^* (S* )]. 



t=l j=l 

By d36l i. we can get 

T M 

r EE^'W) 



< 



*=i i=i 



g + y g *(e) L(Q(1)) 
e Te ' 



(36) 



(37) 



As T -> oo, we have limr-** £ Et=i E£i ^ 



. Thus the queue backlog is bounded and system 



stability holds. Further 

T 



^^(S*)] 



t=l 



y tv 



(38) 



